This report outlines the results of a text mining analysis of the best-seller non-fiction book Invisible Women. Its exploration is meant to better understand some textual characteristics of feminist texts from Caroline Criado Perez. Our text mining analysis shows that [ADD RESULTS HERE].
The goal of the text mining analysis is to study textual data and extract some specific characteristics of feminist texts on gender bias in data by transforming unstructured texts into a set of structured features.
The upcoming analysis is a comprehensive text mining exploration of Invisible Women following a four-part structure: [TO UPDATE]
Data gathering
Data Structuring and Cleaning
Our research questions are the following: [TO UPDATE]
Invisible Women is an award-winning best seller published in 26 languages and sold in 122,255 copies less than a month after its released in March 2019 and before the 2019-lockdown. The book attracted immediate attention from the public and the media, all overwhelmed by the disclosure of the inherent data bias in a world designed for men [REFER THIS AS A QUOTE USING HW1 OF PROGRAMMING TOOLS]. In fact, the book exposes us all to the tremendous amount of situations in which decision-makers use a generic male default to implement public policies without considering or recognizing the seemingly not-so-obvious fact that: what works best for men does not necessarily works best for women.
The instantaneous response of the public to the release of the best-seller was a myriad of prize winning. From the Royal Society Insight Investment Science Book Prize, the FT & McKinsey Business Book of the Year, the Reader’s Choice Books Are My Bag Awards to the Times Current Affairs Book of the Year, Invisible Women achieved unanimity among its audience.
We chose to proceed to the text mining analysis of the content of this book because one of the team members had recently read it for a book club and suggested to dive deeper into the hot-topic of gender bias in data. Considering the orientation we chose to specialize in - Business Analytics - and the amount of time invested in learning about data and perfecting our skills in data science, we realized that we were missing one perspective: data analysis from a gender perspective. Therefore, the analysis of this book allows us to kill two birds with one stone: improving new data science skills (i.e. text mining) on the edge to Artificial Intelligence (AI) and study (text) data from a gender perspective lens.
Here are the book’s characteristics:
| Title | Author | Date | Parts | Chapters | Pages |
|---|---|---|---|---|---|
| Invisible Women | Caroline Perez Criado | 2019 | 6 | 16 | 399 |
The book Invisible Women under study is directly downloaded from an online source https://yes-pdf.com/book/113#google_vignette in its PDF version. To upload it in Rstudio, we use the pdf_text utility from the pdftools package that extracts texts from PDF files. The advantage of uploading the book from a website link is that it is easy to obtain but on the other side, it requires a tedious preparation in order to be able to get started with the cleaning. Here, we first need to manipulate the PDF version to keep only the chapters of the book and remove other useless parts for our analysis such as blank pages, the title page etc..
First of all, we need to gather the parts of the book useful to our analysis (i.e. chapters) in a corpus object. Since the book contains extra information such as a title page, Preface, Content page and others, we first need to proceed to some cleaning of the PDF version in order to remain only with the text of each chapter associated with their titles. Therefore, after indicating exactly where the text under study starts and ends and, after extracting the chapter titles and organizing the text by chapters, we obtain the usable data in a corpus to further analyze Invisible Women’s content.
The below output shows the beginning of the first five chapters of the book. Each chapter is referred as a document and their content as a text. We also notice that chapters are organized in six parts.
| document | text | part |
|---|---|---|
| CHAPTER 1 |
Can Snow-Clearing be Sexist? It all starte… |
1 |
| CHAPTER 2 |
Gender Neutral With Urinals In April 2017 … |
1 |
| CHAPTER 3 |
The Long Friday By the end of … |
2 |
| CHAPTER 4 |
The Myth of Meritocracy For most of th… |
2 |
| CHAPTER 5 |
The Henry Higgins Effect When Facebook… |
2 |
Tokenization is the method used to split a text into tokens. Here, we tokenize the chapters (i.e. document) by space. To do so, we proceed to remove numbers, punctuation, symbols and separators because we believe that it will not affect our analysis. Note that our unit of analysis is word.
The Quanteda package uses a corpus object.
The below summary shows that Invisible Women consists of 16 documents (i.e. chapters) and for each document, three columns indicate the number of tokens, the number of sentences as well as the number of token types per document.
#> Corpus consisting of 16 documents, showing 16 documents:
#>
#> Text Types Tokens Sentences
#> text1 1703 6340 185
#> text2 1884 7515 196
#> text3 2025 8428 206
#> text4 1866 7383 224
#> text5 1650 5829 166
#> text6 1584 5609 196
#> text7 1320 4385 97
#> text8 1306 4170 121
#> text9 2159 8775 322
#> text10 1982 7882 168
#> text11 1806 6771 180
#> text12 1445 5267 132
#> text13 1198 3871 116
#> text14 2000 7935 240
#> text15 851 2375 78
#> text16 1481 4847 98
To continue the cleaning process, we remove useless words that bring very to no information using the stop_words dictionary from the quanteda package and we map letters to lower cases since names (such as first or last names) are not of a specific importance in this book.
The advantage of removing stop words is that it reduces the dimension of the number of features/terms to analyze so that the focus of the analysis is on terms that bring relevant information. In this aim, we remove the word “chapter” which does not provide any value.
Lemmatization simplifies tokens by generating tokens from a dictionary and reduces the vocabulary to its simplest and meaningful essence. Consequently, the set of token types in a corpus is reduced. For example, “started” and “starts” are reduced to “start” and have thus “start” as a lemma.
The below output displays for each chapter the lemmas of the first tokens as well as the total number of different lemmas by chapter. For example, chapter one contains 2,413 different lemmas.
#> Tokens consisting of 3 documents.
#> text1 :
#> [1] "snow" "clear" "sexist" "start" "joke"
#> [6] "official" "town" "karlskoga" "sweden" "hit"
#> [11] "gender" "equality"
#> [ ... and 2,401 more ]
#>
#> text2 :
#> [1] "gender" "neutral" "urinal" "april" "veteran"
#> [6] "bbc" "journalist" "samira" "ahmed" "toilet"
#> [11] "screen" "negro"
#> [ ... and 2,755 more ]
#>
#> text3 :
#> [1] "friday" "day" "october" "icelandic"
#> [5] "friday" "supermarket" "sell" "sausage"
#> [9] "favourite" "ready" "meal" "time"
#> [ ... and 3,091 more ]
Stemming also simplifies tokens by reducing a word to its stem with simple rule-based algorithm usig the token_wordstem() function. As lemmatization, stemming reduces the size of a vocabulary but in an inconsistent way. Indeed, reducing a word to its stem does not guarantee meaningful tokens (e.g official is reduced to offici). This is why, since the interpretation of the tokens matter, we decide not to use the stemming in the rest of our analysis and only apply it here to demonstrate its purpose.
The below output displays the first twelve tokens reduced to their steam for each document. For example, “snow-clearing” was reduced to “snow-clear”.
#> Tokens consisting of 3 documents.
#> text1 :
#> [1] "snow" "clear" "sexist" "start" "joke"
#> [6] "offici" "town" "karlskoga" "sweden" "hit"
#> [11] "gender" "equal"
#> [ ... and 2,401 more ]
#>
#> text2 :
#> [1] "gender" "neutral" "urin" "april" "veteran"
#> [6] "bbc" "journalist" "samira" "ahm" "toilet"
#> [11] "screen" "negro"
#> [ ... and 2,755 more ]
#>
#> text3 :
#> [1] "friday" "day" "octob" "iceland"
#> [5] "friday" "supermarket" "sell" "sausag"
#> [9] "favourit" "readi" "meal" "time"
#> [ ... and 3,091 more ]
Now, without considering the stemming, we compute the Document-Term-Matrix that will be useful throughout the analysis.
The below snapshot of the matrix indicates that after cleaning and lemmatizing, there are 5,742 features to be analyzed and that the DTM is sparse at 83.76% (i.e. contains mostly zeros). The matrix displays the frequency of features (i.e. terms or words here) by documents (i.e. texts or chapters here). For example, the first row indicates that the word sexist is found twice in chapter 1 and the first column indicates that the same word is found in chapter 1, chapter 4 and chapter 6.
#> Document-feature matrix of: 16 documents, 5,742 features (83.76% sparse) and 0 docvars.
#> features
#> docs sexist joke town karlskoga sweden hit initiative lens harsh
#> text1 2 1 6 5 3 1 1 1 1
#> text2 0 0 0 0 4 1 0 0 0
#> text3 0 0 0 0 7 3 0 0 0
#> text4 4 0 0 0 0 0 1 0 1
#> text5 0 0 0 0 0 0 0 0 0
#> text6 1 0 0 0 0 1 0 0 0
#> features
#> docs glare
#> text1 1
#> text2 0
#> text3 0
#> text4 0
#> text5 0
#> text6 0
#> [ reached max_ndoc ... 10 more documents, reached max_nfeat ... 5,732 more features ]
To proceed to the Exploratory Data Analysis (EDA), we use the quanteda package.
To start the EDA, we proceed to visually assess which words are expected to be found the most regularly in the corpus. Thus, the CoW plot is a visual representation of term frequencies in which the size and position of terms are proportional to their frequencies. Nevertheless, this visualization is more graphic than informative since the only information we can extract from is that the terms with the largest font sizes are the most frequent in the corpus. Indeed, from the below plot, we see that woman is the most used term in the corpus, followed by female, datum, male, find, time, gender and study.
To generate the CoW plot, we use the DTM that was obtained after cleaning and lemmatizing but without considering the stemming, see Document-Term Matrix in Section 2 Data Structuring and Cleaning.
To assess more accurately the frequency of the terms in the corpus, we compute the global frequencies. The following graphical representation displays the ten most frequent terms in the corpus. As inferred previously from the CoW, we see that woman is the most frequent term overall followed by female, datum, male, find, time, gender and study.
Global frequencies indicate that woman is by far the most frequent term and that the frequency differences between the following nine terms are much less extreme.
In addition, we see that the term data was lemmatized into datum.
To deep dive into these frequencies, we then display a Term-Frequency (TF) table providing information on term frequencies, their rank and document frequencies. Indeed, the feature lists the lemmatized tokens, the frequency provides the number of times the term is found in the corpus, the rank sorts the terms by decreasing frequencies, the docfreq indicates the number of documents in which the token is found.
The table below shows that woman appears 1594 times in the corpus which is four times more than the second most frequent term female. Moreover, all the terms have a high document-frequency.
| feature | frequency | rank | docfreq |
|---|---|---|---|
| woman | 1594 | 1 | 16 |
| female | 395 | 2 | 15 |
| datum | 358 | 3 | 16 |
| male | 334 | 4 | 16 |
| find | 298 | 5 | 16 |
| time | 260 | 6 | 16 |
| gender | 256 | 7 | 16 |
| study | 224 | 8 | 15 |
| gap | 176 | 9 | 16 |
| sex | 172 | 10 | 15 |
The below graph gives an overall graphical view of the previous table indicating: 1) the term-frequency 2) document-frequency
As observed previously, we see that woman is the most frequent term in the corpus and that it is the most frequent term over all chapters of the book. On the contrary, the terms trial and tax are less frequent overall and also less frequent in documents.
For all the reasons mentioned until here, we decide to remove the term woman because we believe it can hide some important and interesting insights. For the rest of our analysis, it will be indicated if woman is integrated to our analysis again.
The plot below shows kind of the same information than the frequency plot but with a little twist. Here, we show the top 10 more frequent words associated with their chapter. It would have been interesting to have the top frequencies of each documents but there would have been an information overload as there are 16 chapters, so we decided not to display it.
Chapter 10, namely The Drugs Don’t Work, is associated with sex, drug and study. In the previous plot (TF versus DF), we saw that sex and study are not document-specific but that drug is more document-specific. Therefore, we can assume that drug is more specific to chapter 10 then the two other terms. Nevertheless, we do not want to jump to conclusion right now and the document-specificity of terms will be explored deeper later.
Chapter 13, From Purse to Wallet seems to only be associated to tax and as well as drug, tax is more document-specific.
Chapter 14, Women’s Rights are Human Rights is only associated to female.
Chapter 3, The Long Friday is related to pay, leave and time
Lastly chapter 4, Myth of Meritocracy is equally associated to female and male.
Even though some of those words are informative, others are much less insightful. Indeed, it is not enlightening to have female associated to one document as the whole book is about feminism.
The Zipf’s Law shows the distribution of words used in a corpus compared to his rank in the frequency table. It says that the frequency of a token is inversely proportional to its rank. The below plot is on a log-log scale, the frequency versus rank gives a negative linear relation as the original distribution is a negative exponential function.
This plot shows that female, datum, male, find and time are the most frequent terms of the corpus probably indicating that they are not chapter-specific but are frequent in all chapters. Indeed, although Zipf’s Law do not state about the specificity, there is a very low probability that those words are specific to a chapter that is sufficiently lengthy to make it as frequent. Therefore, according to the Zipf’s Law, these terms are very frequent in the overall corpus and consequently could hide some meaningful information as they are not considered stop words. The Zipf’s law now leads us to move to look at weighted frequencies.
The TF-IDF matrix is a weighted document-feature matrix displaying term frequency–inverse document frequency. In other words, it is used to re-balance a term frequency with respect to its document-specificity.
The below output shows, for the document-feature matrix of 16 documents and 5’741 features, the weighted frequency for each token by document. The sparsity has slightly increased in comparison to the DTM matrix as “women”, a term occurring in all the document, has been removed.
#> Document-feature matrix of: 16 documents, 5,741 features (83.77% sparse) and 0 docvars.
#> features
#> docs sexist joke town karlskoga sweden hit initiative lens
#> text1 0.718 0.903 5.42 6.02 1.08 0.204 0.505 0.903
#> text2 0 0 0 0 1.44 0.204 0 0
#> text3 0 0 0 0 2.51 0.612 0 0
#> text4 1.436 0 0 0 0 0 0.505 0
#> text5 0 0 0 0 0 0 0 0
#> text6 0.359 0 0 0 0 0.204 0 0
#> features
#> docs harsh glare
#> text1 0.903 0.903
#> text2 0 0
#> text3 0 0
#> text4 0.903 0
#> text5 0 0
#> text6 0 0
#> [ reached max_ndoc ... 10 more documents, reached max_nfeat ... 5,731 more features ]
The following plot shows the twenty largest TF-IDF and their respective terms. Note that it is equal to the result of computing for each term displayed the maximum TF-IDF over all chapters of the book. The output shows that tax has the largest TF-IDF in at least one chapter of the book and that trial, drug and dummy appear also quite often in the corpus and in few documents.
The following plot shows the 10 highest TF-IDF associated with their chapter. Chapter 10, The Drugs Don’t Work, is now associated with trial and drug To avoid redundancy, we will not comment each of the term-document. But for chapter 10, The Drugs Don’t Work, we can see it is now associated with trial and drug. The term tax is still really frequent and we can now definitely state that it is specific to chapter 13, From Purse to Wallet. Chapter 14, Women’s Rights are Human Rights, is in fact more specifically associated with the term interrupt than with the term female as shown in the Term-Frequency by Document section.
[what’s the firm term ? vr ? ]
The keyness measure is a chi-square test of independence indicating whether some terms are characteristic of a target compared to a reference. To illustrate that we select chapter 14, Women’s Rights are Human Rights, which has the vague specific term interrupt and compute its Keyness compare it to the other chapters (i.e. reference).
The plot below allows us to see that this chapter is characterized by the terms party, politician, election or even candidate and to conclude at first glance that this chapter is more about political topics than the rest of the corpus.
Then, we compute for each chapter of the book the keyness of terms in order to better understand what each chapter is about. The following visualization is an animated illustration (in a gif format). Each chapter is then at some point the target and the reference.
Here, we look at how words co-occur and how inter-connected they are and to do so, we first compute the co-occurrences between terms.
The feature co-occurrence matrix is a 5,741 by 5,741 matrix (32,959,081 elements) in which is displayed the number of times two terms co-occur in the corpus. Because of the large size of the matrix, we decide to reduce its size by keeping only co-occurrences greater than 110. The latter condition allows us to focus our attention on terms that appear the most together, implying that they have a specific connection of interest in the context of the book. After applying this condition to the matrix, we get the following smaller feature co-occurrence matrix of dimensions 20 by 20 features (400 elements).
#> Feature co-occurrence matrix of: 20 by 20 features.
#> features
#> features datum male find time gender study gap sex pay report
#> datum 4481 8548 6860 5544 5961 5258 4026 4303 3080 3346
#> male 8548 4854 7730 5167 6103 6348 3859 5317 2285 2995
#> find 6860 7730 3485 5218 4917 6060 3362 5104 3017 2810
#> time 5544 5167 5218 3000 4469 3644 3291 2252 5227 2250
#> gender 5961 6103 4917 4469 2529 3490 2884 2746 3065 2414
#> study 5258 6348 6060 3644 3490 2622 2456 5277 1636 2181
#> gap 4026 3859 3362 3291 2884 2456 1008 1864 2461 1410
#> sex 4303 5317 5104 2252 2746 5277 1864 3943 562 1794
#> pay 3080 2285 3017 5227 3065 1636 2461 562 2878 1144
#> report 3346 2995 2810 2250 2414 2181 1410 1794 1144 876
#> [ reached max_feat ... 10 more features, reached max_nfeat ... 10 more features ]
Using the above feature co-occurrences matrix, we generate a network (object) displaying visually the inter-connections of interest between the co-occurring terms appearing more than 110 times in the corpus. To generate a readable network of co-occurrences, we add a second condition on the co-occurring features and we keep only the co-occurrences greater than 2,100.
The below network reveals that the terms datum, male and find are central and co-occur a lot with the surrounding terms. The output shows a surprising finding which is that for a book named Invisible Women, the term female is not a central term co-occurring the most with other terms. Therefore, to dig deeper into this finding, we decide to re-introduce the term woman and we find an even more surprising result which is that woman is still not at the center of the co-occurrences despite its significantly large frequency (1,594 out of 36,160 or 4.4% of all frequencies).
After investigating term co-occurrences, we look at how terms move together in the book. Dispersion or X-Ray plots inspect where a specific token is used in each text by locating a pattern in each text.
The below lexical dispersion plot shows how the terms female and male move along the chapters. First, male is found in all chapters but only once in chapter 6 Being Worth Less Than A Shoe whereas as female is only not found in chapter 15, namely Who Will Rebuild. When considering the implication of the title of this chapter, this finding seems a bit curious. Second, these two terms often seem to appear together at some point in chapters. Third, we also see that they are both present in chapters 4 The Long Friday and 14 Women’s Rights are Human Rights at a higher frequency but not necessarily used at the same location suggesting that the author might compare the two more in these chapters.
After exploring the movements of female and male, we look at the movements between female and sex. Note that here sex is reduced to its lemma so it could refer to the gender, the nature of a relation or any other terms related to sexuality. The below plot shows that the term sex appears more in chapter 10 The Drugs Don’t Work and that female and sex are not necessarily associated.
Lastly, we explore the movements of male and sex. This following plot reveals that male and sex seem to be used together most often in Chapter 10 The Drugs Don’t Work. Furthermore, using previous results where we find that chapter 10 is associated with trial and drug, we can conclude that this chapter probably focuses on clinical trials and gender.
Lexical diversity is a diversity index that measures the richness of the vocabulary in one document.
The TTR is a diversity measure indicating a document’s richness in the number of token types. The more types of tokens are found in a document, the richest is the vocabulary of this specific document. The closest to 1 the TTR is, the richest the vocabulary of a document is. We need to be careful with the TTR measure because it is dependent on the length of the document. TTR is computed using the document-term matrix.
The below graph sorts chapters of the book by descending TTR. According to TTR, chapter 15 Who Will Rebuild? has the richest vocabulary among all chapters of the book with a TTR of 0.592 and chapter 3 The Long Friday has the poorest with a TTR of 0.374. Overall, the richness of vocabulary is not very diverse which could be explained by the fact that the author focuses on the specific gender issue, therefore using repetitively gender-specific terms.
The Moving-Average Token-Type Ratio is an average of the Token-Type Ratio. It is an algorithm using windows of the text to compute the TTR and repeating several times over different windows of the same size the TTR computation. The advantage of the MATTR is that it is less dependent on the length of the document than the TTR. Note that a too large window can produce an error since no local TTR can be computed and a too small window results in pointless values (always 1). MATTR is computed using ordered tokens.
Considering the documents lenght, we decide to compute the MATTR with a window of 80 tokens. The below graph shows that the Moving-Average Token-Type Ratio among chapters are very similar and that it ranges from 0.725 to 0.798. According to the MATTR, chapter 15 Who Will Rebuild? still has the richest vocabulary with an MATTR of 0.798.
This section dives deeper into the analysis of the content of the corpus in which each part is supported by well-designed relevant charts and graphs using the ggplot2, sentimentr, reshape2, quanteda.textmodels, seededlda and text2vec packages.
The previous exploratory data analysis sheds light on overall and document-specific findings related to terms and vocabulary used throughout the book. Indeed, following a top-down approach, we examine terms specific to the general corpus which unsurprisingly reveal a gender-centric vocabulary and then look more closely at the focus of each chapter on the gender discrimination issue on which Invisible Women devotes all its attention.
Looking at the complexity of the book, Invisible Women is built of 16 chapters divided in six parts and tells its story over 280 pages. For the purpose of the following analysis, we take care of talk about the complexity of the book: [not sure about the point of this sentence??] The lenght of the document without any cleaning is 97’382. With the cleaning process we keep less than 40% of the tokens present in the document with a lenght of 36’160. The vast majority of the words bringing little to no information, we can consider that the the book is not so complex [can we really say that ??] [DO YOU HAVE ANY OTHER IDEAS THAT COULD ASSESS THE COMPLEXITY OF THE TEXT ?] -> talk about the uniqueness of the data : [NOT SURE WHAT TO MENTION HERE BUT MUST BE MENTIONNED]
The subsequent analysis is as follows. We first study the sentiment of the corpus by extracting for each chapter its average sentiment using a qualitative and a quantitative approach. Then, we focus our attention on the similarity between terms to study their context and between chapters to better understand their associated topics of interest. From clustering of term similarities, we continue through topic modelling using two approaches. Finally, we end with an unsupervised and supervised learning methods to represent terms and documents in dimensions.
Now that the content of the corpus is cleaned and that we explored its content, we proceed to its further analysis with first its sentiment analysis (i.e. opinion mining) which qualifies or quantifies the sentiment emerging from one text. To proceed to the sentiment analysis, we use two approaches. The first one uses qualifiers (i.e. dictionary-based) and the second one uses numerical values (i.e. value-based).
When analyzing the sentiment emerging from a document, we care to not remove stop words since they might be in the sentiment dictionary and provide useful insight.
The dictionary-based sentiment analysis matches tokens from each document to a reference dictionary with token values and look for word polarity (i.e. association to a sentiment). The dictionary matches terms to a positive, negative, neg_positive or neg_negative sentiment. For simplicity purposes, we only consider the positive and negative sentiment in the rest of this analysis. Note that the sentiment is the average over token values of the document.
The disadvantage of the dictionary-based sentiment analysis is that the negative forms of words are not taken into consideration. For example, in the sentence I don’t enjoy the show, the sentence will be considered positive because it will not consider the contraction don’t but only the word enjoy.
The below interactive graph shows for each chapter of the book the proportion of terms matched with a positive and negative sentiment. For example, chapter two Gender Neutral With Urinals is found to have 177 terms matched with a positive sentiment and 408 with a negative one.
Overall, positive and negative sentiments are found in all 16 chapters of the book although more terms are recognized as negative (3,596) than as positive (2,661) indicating that the frequency of negative terms is higher than the one of positive terms.
The valence shifters approach uses positive and negative sentiment scores (i.e. value-based) to extract the sentiment of a document. Here, we use two dictionaries, a polarized words dictionary where we find a list of terms communicating a positive or negative attitude and a valence-shifters dictionary which provides terms that alter or intensify the meaning of the polarized words.
The next table shows the first five words of the polarized words dictionary and their respective numerical scores.
| token | value |
|---|---|
| a plus | 1.00 |
| abandon | -0.75 |
| abandoned | -0.50 |
| abandoner | -0.25 |
| abandonment | -0.25 |
The next table shows the first five words of the valence-shifters dictionary and their respective numerical scores.
| token | value |
|---|---|
| absolutely | 2 |
| acute | 2 |
| acutely | 2 |
| ain’t | 1 |
| aint | 1 |
To proceed to the valance-shifters sentiment analysis, we first extract the sentences from the text and compute their sentiment value. Here, we do not assign weights to certain types of sentences (e.g. questions) since we believe that the sentence type does not have a particular influence on our analysis.
The below output displays the sentiment values of the first five sentences of the corpus, respectively of chapter one, and indicates for each the sentiment emerging from the terms of the sentences. Anything numerical value below 0.05 is considered negative and any value above 0.05 is considered positive. Anything in between is considered neutral. Thus, the first, fourth and fifth sentences are negative, respectively -0.408, -0.296 and -0.535, and the second and third ones are positive, respectively 0.245 and 0.096. Note that because the column word_count had NAs, we remove rows that have no available information since no sentiment can be extracted.
| document | sentence_id | word_count | sentiment |
|---|---|---|---|
| Chapter 1 | 1 | 6 | -0.408 |
| Chapter 1 | 2 | 6 | 0.245 |
| Chapter 1 | 3 | 33 | 0.096 |
| Chapter 1 | 4 | 33 | -0.296 |
| Chapter 1 | 5 | 14 | -0.535 |
Since sentiment changes as sentences change, we zoom out to look at the sentiment score by chapter of the book. The ave_sentiment column gives the average sentiment score by chapter.
The next interactive graph displays the average sentiment scores by chapter in a decreasing fashion. According to it, chapter 4 The Myth of Meritocracy has the greatest positive average (0.323) and chapter 16 It’s Not The Disaster That Kills You has the the most negative average sentiment. In total, five chapters have a positive sentiment (i.e. above 0.05), four have a negative sentiment (i.e. below -0.05) and seven chapters have a neutral sentiment (i.e. between -0.05 and 0.05).
Similarity is a numerical value used to measure proximity between terms, to see if they are used in the same context or between documents, to see if they use the same terms. Note that similarity is dependent on the types of tokens found in the corpus. If it uses mostly one vocabulary, it might be possible that most terms are similar.
To compute similarities, we use three different measures namely the Jaccard Index, Cosine Similarity and Euclidean Distance all three using the term-frequency inverse-document frequency.
For all the different clusters, we decided to cut tree at 6 clusters as there is 6 part in the book.
We first compute the similarity between chapters of the book to investigate whether they use the same token types.
We first compute the Jaccard index matrix displaying the relative number of common words using the TF-IDF matrix. Note that the Jaccard coefficient considers only once each token type (i.e. set model).
From the Jaccard similarity matrix, where similarities are based on the Jaccard coefficient (i.e. relative number of common words) by document, we get the below heatmap in which chapters are likely to use similar terms. A red square indicates a strong similarity (e.g. 1) whereas a dark blue square indicates no similarity. Therefore, the heatmap shows that chapter 15 Who Will Rebuild is the text with the least similarity with other chapters (score closer to 0). Also, we observe that chapters 10 The Drugs Don’t Work and 11 Yentl Syndrome seem to use a bit more similar terms.
Then, we compute the cosine similarity matrix which computes the similarity between two vectors of an inner product space. Note here that the similarity is independent of the vector length and that only the cosine angle between two weighted term-frequency vectors is determinant.
Compared to the heat map generated using the Jaccard index, the next heatmap displays much more no similarity (darker blue squares) between terms used in different chapters than the previous one. However, chapter 10 and 11 are still shown as sharing less similarity.
Finally, we compute the Euclidean-based similarity matrix using the Euclidean distance. The below heatmap shows again a different output. Indeed, less documents are shown as having no similarity and more chapter similarities stand in a middle in which we cannot infere on similarity. For example, chapter 15 Who Will Rebuild and 16 It’s Not The Disaster That Kills You do not seem to either show similarity or dissimilarity. On the contrary, chapters 12 A Costless Resource To Exploit and 15 Who Will Rebuild are slightly more using similar terms.
To proceed to clustering documents, we need to build the dissimilarities and/or Vector Space Model (VSM) on which we can apply the clustering methods, hierarchical clustering based on distances and K-mean partitioning based on features.
Clusters are difficult to interpret. This is why we will look at the largest term frequencies of clusters to better understand what is the common denominator for the grouping.
Hierarchical clustering is based on distances and applied on the dissimilarities using the function hclust(). The hierarchical approach assigns each document to its own cluster and then at each iteration the two most similar chapters are grouped together in one cluster. The iteration continues until all chapters belong to a cluster.
The inverted Jaccard dissimilarity matrix shows that there are two main clusters. One cluster groups chapters 15 Who Will Rebuild and 16 It’s Not The Disaster That Kills You and the other one groups the rest of the book. Moreover, inside the second larger cluster, we see that there are two sub-clusters. From the dendogram, we understand that chapters 10 The Drugs Don’t Work and 11 Yentl Syndrom are the most similar ones as their distance is the smallest (~0.76). On the contrary, chapters [CAN WE ADD WHICH CHAPTERS ARE THE MOST DISSIMILAR?]
The following table indicates to which cluster a chapter belongs.
| cluster1 | cluster2 | cluster3 | cluster4 | cluster5 | cluster6 |
|---|---|---|---|---|---|
| text1 , text2 , text3 , text7 , text12, text14 | text4 , text5 , text6 , text9 , text10, text11 | text8 | text13 | text15 | text16 |
For interpretation purposes, we extract the ten words that are the most frequent in each cluster to identify a common denominator between chapters. According to the most used terms in each cluster, cluster 1 groups terms about public transports or spaces, cluster 2 about medical or technological trials and clusters three to six are composed of only one chapter and therefore group terms according to their respective chapters’ vocabulary.
| Clust.1 | Clust.2 | Clust.3 | Clust.4 | Clust.5 | Clust.6 |
|---|---|---|---|---|---|
| transport | trial | keyboard | tax | rebuild | refugee |
| bus | drug | corpus | poverty | peace | violence |
| stove | dummy | voice | earner | orleans | shelter |
| interrupt | vr | pianist | marry | disaster | disaster |
| toilet | tech | algorithm | household | agreement | homeless |
| pedestrian | crash | recognition | file | miami | homelessness |
| travel | chemical | dataset | zombie | displace | conflict |
| party | pain | handspan | couple | fordham | ebola |
| plough | meritocracy | phone | income | gujarat | cyclone |
| agriculture | clinical | inch | youth | hurricane | camp |
The inverted cosine dissimilarity matrix seems to show that there are again two main clusters and for each we observe two smaller sub-clusters. One main cluster groups chapters 10, 11, 5, 6, 8 and 9 and the other one groups chapters 7, 11, 12, 15, 16, 13, 3, 2, 4 and 14. The dendogram generated from the inverted cosine dissimilarity matrix indicates that again chapter 10 The Drugs Don’t Work and 11 Yentl Syndrom are the most similar ones as their distance is the smallest (~0.70). On the contrary, chapters [CAN WE ADD WHICH CHAPTERS ARE THE MOST DISSIMILAR?]
| cluster1 | cluster2 | cluster3 | cluster4 | cluster5 | cluster6 |
|---|---|---|---|---|---|
| text1 , text2 , text15, text16 | text3 , text12, text13 | text4 , text14 | text5, text6, text8, text9 | text7 | text10, text11 |
For interpretation purposes, we extract the ten most frequent terms of each cluster to identify a common denominator between chapters. According to the most used terms in each cluster, cluster 1’s chapters share terms specific to public transport or spaces and violence, cluster 2’s chapters use terms regarding households, families and economics, cluster 3 reveals a political vocabulary, cluster 4 shows terms related to technology, cluster 6 displays terms about agriculture and finally, cluster 6 indicates a medical vocabulary.
| Clust.1 | Clust.2 | Clust.3 | Clust.4 | Clust.5 | Clust.6 |
|---|---|---|---|---|---|
| transport | tax | interrupt | dummy | stove | trial |
| bus | pay | candidate | vr | plough | drug |
| toilet | gdp | party | crash | agriculture | clinical |
| pedestrian | poverty | meritocracy | chemical | stave | pain |
| travel | childcare | politician | ppe | farmer | cell |
| violence | household | election | boler | agricultural | blood |
| disaster | marry | hire | worker | crop | fda |
| snow | offer | mp | stoffregen | farm | heart |
| sánchez | week | aw | tech | strength | disease |
| madariaga | carer | teach | keyboard | doss | medication |
The Euclidean dissimilarity matrix seems to show that there is sequence of clusters which is a completely different clustering pattern compared to the other two distance measures. This pattern seems to indicate that each chapter is rather independent from the others and do not share a lot of similarities across chapters of the book. According to Euclidean-based dissimilarity matrix, we see that chapters 12 A Costless Resource to Exploit and 15 Who Will Rebuild? share the smallest distance indicating that they are the most similar. This finding is different than the ones using the two other distance measures. On the contrary, chapters [CAN WE ADD WHICH CHAPTERS ARE THE MOST DISSIMILAR?]
| cluster1 | cluster2 | cluster3 | cluster4 | cluster5 | cluster6 |
|---|---|---|---|---|---|
| text1 | text2 , text3 , text4 , text5 , text6 , text7 , text8 , text11, text12, text15, text16 | text9 | text10 | text13 | text14 |
For interpretation purposes, we extract the ten most frequent terms of each cluster to identify a common denominator between chapters. According to the most used terms in each cluster, cluster 1 shows terms specific to transport and travel, cluster 2’s chapters use terms that are not necessarily specific to one topic, cluster 3 reveals terms related to technology, cluster 4 indicates a medical vocabulary, cluster 5 displays terms related to households and economics and finally and finally, cluster 6 shows politics-related terms.
| Clust.1 | Clust.2 | Clust.3 | Clust.4 | Clust.5 | Clust.6 |
|---|---|---|---|---|---|
| pedestrian | toilet | dummy | trial | tax | interrupt |
| transport | stove | vr | drug | poverty | party |
| snow | violence | crash | cell | earner | candidate |
| sánchez | worker | boler | clinical | marry | politician |
| madariaga | pay | stoffregen | fda | household | election |
| travel | bus | tech | nih | file | mp |
| de | chemical | motion | blood | zombie | aw |
| trip | girl | seat | medication | couple | representation |
| favela | plough | belt | adr | income | political |
| clear | meritocracy | tin | animal | youth | ambition |
K-means is a non-hierarchical partitioning method for clustering. In text analysis, K-means is based on feature frequencies. Here, we set the pre-defined number of clusters (i.e. respectively number of centroids) to six since there are six parts in the book. By doing so, we expect to see chapters being grouped following the parts division logic of the author.
| cluster1 | cluster2 | cluster3 | cluster4 | cluster5 | cluster6 |
|---|---|---|---|---|---|
| text14 | text3 , text4 , text5 , text6 , text7 , text8 , text11, text12, text13, text15, text16 | text1 | text2 | text9 | text10 |
For interpretation purposes, we extract the ten most frequent terms of each cluster to identify a common denominator between chapters. According to the most used terms in each cluster, cluster 1 shows terms specific to politics, cluster 2’s chapters use terms that are not necessarily specific to one topic, cluster 3 reveals terms related to transportation and travel, cluster 4 reveals terms about public spaces and violence, cluster 5 displays terms related to technology and finally, cluster 6 indicates a medical vocabulary.
| Clust.1 | Clust.2 | Clust.3 | Clust.4 | Clust.5 | Clust.6 |
|---|---|---|---|---|---|
| interrupt | tax | pedestrian | toilet | dummy | trial |
| party | stove | transport | bus | vr | drug |
| candidate | pay | snow | transport | crash | cell |
| politician | worker | sánchez | girl | boler | clinical |
| election | violence | madariaga | transit | stoffregen | fda |
| mp | chemical | travel | urinal | tech | nih |
| aw | plough | de | harassment | motion | blood |
| representation | meritocracy | trip | loukaitou | seat | medication |
| political | disaster | favela | sideris | belt | adr |
| ambition | agriculture | clear | sexual | tin | animal |
Now, we analyze similarities between words through chapters (i.e. documents). A word is embedded in a Vector Space Model using its document frequencies (or weighted frequencies) this is why, we proceed to computing similarities between terms using the DTM matrix. To compute the similarities between words, we use the same three different measures as when computing the similarities between documents. Note that because of the large number of tokens, we focus on the highest word-frequencies respectively, the lowest word ranks (i.e. smaller or equal to 40).
The below heatmap shows that according to the Jaccard index most words among the 40 most frequent ones are very similar or identical. Although, the Jaccard similarity indices of tax and all other 39 words are represented by a dark blue squares, indicating that the these words in each pair with tax are not used in similar proportion through documents and consequently are not considered similar through documents. Moreover, worker and pay are not considered similar nor dissimilar according to the Jaccard index since their indices are represented by white squares (similarity of 0.5).
The below heatmap shows that according to the Cosine similarity measure some words are very similar (but not exactly identical). For example, the Cosine similarity estimate of explain and mean or find and study are represented by a dark orange square, indicating that the two words in each pair are used in similar proportion through documents. On the contrary, design and tax are found to not be used in similar proportion through documents (dark blue square) which indicates that these two terms are not considered similar through documents.
[TO DO !!]
To avoid redundancy, we decide not to use the same similarity measures for the clustering methods (Jaccard index, Cosine distance, Euclidean distance) as in the Clustering of Documents and to use instead the co-occurrence similarity measure to cluster the 40 most frequent terms in the corpus that we studied above in Similarities Between Words.
Co-occurrence is also a similarity measure for which two words are considered similar if they are often used closely in a similar context, in any part of the corpus. Therefore, co-occurrence similarity is dependent on the token’s context (i.e. document or window). Here, we consider the context as being a window of terms around a target (i.e. central word or center). The subtlety of the co-occurrence clustering is that by considering the context of a word, we are computing similarities beyond literal resemblance and, consider as well the meaning of words. Thus, this measure is different than the Jaccard, Cosine or Euclidean similarity measures which do not depend on the context of a word. Note that since the context is important, the tokens ordered must kept when computing the co-occurrence matrix (i.e. Bag of Word object cannot be used here).
To proceed to the clustering, we first compute a symmetrical co-occurrence matrix on a window of 30 terms around a target used as a similarity matrix. The below heatmap shows the similarities of the most frequent terms.
Then, we transform the co-occurrence matrix into a dissimilarity matrix (maximum co-occurrence - similarity matrix) from which we derive the co-occurrence distance-based clustering (i.e. hierarchical clustering).
[ADD INTERPRETAION: KEEP IN MIND THAT THE CO-OCCURRENCE MATRIX IS TURNED INTO A DISSIMILARITY MATRIX -> PROVIDE EXAMPLE WITH FEMAL-WOMAN]
From the above clustering of documents, we understand that clusters are formed by grouping chapters with similar terms or vocabularies and that by looking at the most frequent terms in each cluster we can get an idea of what groups of chapters talk about. This is referred at the topic of a document. Note that one document can treat several topics. Topics are associated to both documents and terms. Indeed, a topic is associated to documents that use similar terms related to this topic and a topic is associated to terms that appear in documents treating this topic.
Topic modeling is a type of statistical modeling for discovering the abstract topics that occur in a collection of documents. There are two approaches to the statistical modeling of topics, the Latent Semantic Analysis (LSA) model and the Latent Dirichlet Allocation (LDA) model.
Both models can be applied on DTM or TF-IDF matrix and here we decide to proceed to the topic modeling on the DTM.
LSA is a dimension reduction method that decomposes the DTM in three sub-matrices around a pre-determined number of topics. \(\Sigma\) represents the strength of each topic, \(U\) expresses the documents-topic similarity (i.e. links between documents and each topic) and \(V\) expresses the terms-topic similarity (i.e. the links between the terms and each topic).
Here, the advantages or goals of LSA is to find and interpret topics within chapters of Invisible Women and to reduce the dimension of the DTM object by removing sparsity which does not allow us to infer on terms association with chapters of the book. On the other hand, its disadvantage is the difficulty to interpret its results.
To build the LSA object, we use the textmodel_lsa function of the quanteda.textmodels package and we specify the number of dimensions to be included to be ten. Then, we proceed to the LSA decomposition into the three sub-matrices extracted from the DTM. The below output shows the documents-topic matrix \(U\) for the first six chapters, the terms-topic matrix \(V\) for the ten most frequent terms [IS THAT CORRECT?????] and finally the topics’ strength \(\Sigma\) in decreasing order.
The first matrix \(U\) shows that chapter 1 Can Snow-Clearing be Sexist? is the most positively associated to topic 8 and the most negatively associated to topic 6. The second matrix \(V\) indicates that sexist is the most positively associated to topic 8 and the most negatively associated to topic 3.
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> text1 -0.209 0.2395 -0.1092 0.33446 -0.1567 -0.3614 -0.2531
#> text2 -0.305 0.0861 -0.3325 0.59252 -0.2725 0.1033 -0.0911
#> text3 -0.360 0.5805 0.4191 -0.24507 0.0847 0.2455 -0.0717
#> text4 -0.264 -0.0931 -0.2497 -0.35944 -0.0117 -0.0363 -0.7081
#> text5 -0.225 -0.0538 -0.0532 0.01208 0.2880 0.0528 0.0510
#> text6 -0.158 0.1646 0.0784 -0.00563 0.1064 0.1134 -0.0199
#> [,8] [,9] [,10]
#> text1 0.6005 0.1464 -0.2439
#> text2 -0.2742 -0.2068 0.3639
#> text3 0.0781 0.0193 0.3845
#> text4 -0.3163 0.2092 -0.0388
#> text5 -0.0705 -0.2676 -0.4584
#> text6 -0.1354 -0.6208 -0.3141
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> sexist -0.006851 -0.000880 -2.24e-02 -0.019373 -0.01295 -0.00206
#> joke -0.000954 0.000358 -2.30e-03 0.002624 0.00487 -0.00551
#> town -0.002891 0.007186 -6.16e-03 0.014802 -0.00174 -0.02203
#> karlskoga -0.001937 0.006828 -3.86e-03 0.012178 -0.00661 -0.01652
#> sweden -0.011540 0.027708 3.73e-05 0.004557 -0.01917 0.01537
#> hit -0.009287 0.009227 1.40e-02 0.011553 -0.00589 0.01135
#> initiative -0.004452 0.005432 -2.34e-03 -0.000788 0.00542 -0.01618
#> lens -0.000954 0.000358 -2.30e-03 0.002624 0.00487 -0.00551
#> harsh -0.000876 0.000835 -2.54e-03 -0.000182 -0.00142 -0.00364
#> glare -0.000732 0.003134 9.86e-05 0.001884 -0.00197 -0.00550
#> [,7] [,8] [,9] [,10]
#> sexist -0.017770 0.00728 0.00448 -2.01e-02
#> joke -0.000994 0.00801 0.00189 1.25e-03
#> town -0.013314 0.03864 0.00994 -1.29e-02
#> karlskoga -0.012320 0.03063 0.00805 -1.41e-02
#> sweden 0.004629 0.01272 0.00367 3.63e-02
#> hit 0.008635 -0.00434 0.00157 1.08e-05
#> initiative -0.003690 0.00210 0.00708 -2.37e-02
#> lens -0.000994 0.00801 0.00189 1.25e-03
#> harsh -0.009358 0.00290 0.00391 -3.27e-03
#> glare -0.002062 0.00670 0.00154 -5.42e-03
#> [1] 539.6 175.4 141.6 137.3 118.5 109.4 102.7 98.0 90.9 86.4
Before delving into the topic modeling analysis, we start by looking at the first dimension of the LSA since it is often correlated with the length of the document. The following plot shows that there is a negative linear relationship between the first component and the document length (i.e. total number of tokens in a document) indicating that the lengthier is a document (i.e. chapter), the less it will be associated to dimension 1. For this specific reason, we ignore the first component of LSA i the further topic analysis.
Then, to interpret the topics (i.e. dimensions) of the LSA, we look at the ten terms with the largest values and the ten terms with the lowest values. We decide randomly to take a look at dimension (i.e. topic) 4 and 5.
According to the below output, topic 4 is positively associated to public, transport, toilet, woman, space, bus, travel, sexual, report, girl and negatively associated to politician, party, government, country, candidate, leave, pay, bias, male, female. Therefore, chapters that are associated with topic 4 use more the first ten terms and less the last ten terms. Consequently, chapters strongly associated with component 4 are likely to talk about female experience in public spaces.
#> public transport toilet woman space bus
#> 0.2368 0.1976 0.1732 0.1674 0.1527 0.1501
#> travel sexual report girl politician party
#> 0.1055 0.1050 0.1042 0.0992 -0.0836 -0.0878
#> government country candidate leave pay bias
#> -0.0891 -0.0935 -0.0936 -0.0991 -0.1237 -0.1252
#> male female
#> -0.1620 -0.3852
According to the below output, topic 5 is positively associated to test, dummy, body, car, crash, tech, datum, seat, vr, design and negatively associated to tax, government, transport, trial, find, female, drug, sex, gender, public.Therefore, documents that are associated with topic 4 use more the first ten terms and less the last ten terms. Consequently, chapters strongly associated with component 5 are likely to talk about technology.
#> test dummy body car crash tech
#> 0.2104 0.1858 0.1847 0.1613 0.1542 0.1404
#> datum seat vr design tax government
#> 0.1274 0.1250 0.1239 0.1229 -0.0940 -0.0944
#> transport trial find female drug sex
#> -0.0991 -0.1008 -0.1058 -0.1077 -0.1123 -0.1130
#> gender public
#> -0.1327 -0.1522
Now, to connect the topics and the documents and the topics and the terms, we generate a biplot that associates the position of the words and the position of the chapters in the LSA space, for dimensions 4 and 5 detailed above. To avoid readability issues due to the large number of terms, we display only words that are the most associated to these dimensions.
Here, we see that topic 4 is associated with chapters 1, 2, 11, 15, 16 and terms woman, report, sexual, travel, bus, space, toilet, transport, public and anti-associated with chapters 3, 4, 12, 13, 14 and terms male, pay, leave, politicians, bias, country, politicians etc… Topic 5 is associated with chapters 5, 6, 7, 8, 15, 19 and terms test, dummy, body, car, crash, tech, datum and anti-associated with chapter 10 and terms trial, drug, sex.
The Latent Dirichlet Allocation is a generative model (i.e. Bayesian/probabilistic model) meaning that it generates documents (i.e. BoW) where the number of topics are pre-defined, as with LSA. The model works as follows. Random proportions of topics are drawn in a document and for each word of the document, a topic is selected at random. Then, given this randomly selected topic, a word in its vocabulary is also selected at random. Note that the main disadvantage of LDA is the difficulty to interpret its results and that its advantage is that we obtain probabilities in addition to the term-topic assignment. To be able to interpret its results, the model is represented as being a set of conditional probabilities defined by a set of parameters fitted to the DTM using a maximum likelihood approach.
To build the LDA object, we use the textmodel_lda function of the seededlda package and we specify the number of dimensions to be included to be ten.
After building the LDA object, we look at the top three terms per topic and the top three topics per document. The following output shows that public, sexual and toilet are the top three words of topic 1 and that [CAN YOU HELP ME INTERPRET ???].
#> topic1 topic2 topic3 topic4 topic5 topic6
#> [1,] "violence" "hand" "pay" "test" "public" "woman"
#> [2,] "disaster" "voice" "unpaid" "body" "transport" "datum"
#> [3,] "rape" "algorithm" "hour" "car" "space" "male"
#> topic7 topic8 topic9 topic10
#> [1,] "bias" "female" "worker" "sex"
#> [2,] "student" "government" "stove" "drug"
#> [3,] "teach" "mp" "body" "study"
#> .
#> topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8
#> 2 1 3 1 2 2 1 1
#> topic9 topic10
#> 1 2
The term-topic analysis computes the conditional probability \(\phi\) to find a term knowing that it is assigned to a given topic. Then, for a given topic, the largest conditional probabilities give the terms that are the most associated with this topic.
The below output shows for each topic the terms with the highest conditional probabilities \(\phi\). For example, the conditional probability of finding woman knowing that this term is assigned to topic 10 is 1 which indicates that woman is very strongly associated with topic 10 and consequently, that any document associated with topic 10 will have the term woman in it. Moreover, we also see that some topics are better defined than others and that some topics can overlap. For example, topic 10 is the best defined topic whereas and topics 2 is not well defined at all. The reason for overlapping is probably linked to the cleaning process of the corpus.
The topic-document analysis computes the conditional probability \(\theta\) to find a word in a topic knowing that it is assigned to a specific document. Then, for a given document, the largest conditional probabilities give the terms that are the most associated with this document.
The below output shows, for the six longest chapters, the proportion (i.e. the highest conditional probabilities \(\theta\) for each topic) of each topic in the document. For example, chapter 2 Gender Neutral With Urinals mainly talks about topic 1 and 10 (not in the same proportion though).
The prevalence is the the topic distribution (proportion) in a corpus:
\(Prev(topic\ k) = (\frac{1}{M})\Sigma_{m=1}^M \theta (k,m)\)
The following output displays the prevalence scores for each topic. We see that topic 10 is the most prevalent topic in the corpus (6.264) and that topic 8 is the least prevalent (0.616).
#> topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8
#> 1.240 0.647 2.235 0.633 1.186 6.312 0.647 0.700
#> topic9 topic10
#> 1.272 1.128
Topic modeling allows to organize, understand and summarize large corpora. Yet, topic modeling has limitations especially regarding the interpretation of its outcomes. In fact, other measures provide a way to extract further insight.
The measure of coherence allows to assess the quality of a topic (i.e. good versus bad):
\(C = \Sigma_{t=2}^t \Sigma_{t'=1}^{t-1}log(\frac{D(v_t,v'_t) + 1}{D(v'_t)})\)
The below output gives the coherence for each ten topic. We see that the most coherent topic is topic 10 (0.557) and the least coherent topic is topic 8 (-9.174). Therefore, the words associated with topic 10 (large \(\theta\)) are used in the same documents.
#> topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8
#> -3.365 -3.644 -2.684 -9.492 -6.122 0.428 -3.876 -8.080
#> topic9 topic10
#> -2.358 -4.306
To verify the above statements, we take a look at the co-document frequencies. Below, we compare the term-frequency matrices of topics 8 (least coherent) and 10 (most coherent). The matrices for topic 8 (top) and 10 (bottom) show that it is obvious that the top five terms in topic 10 are co-occurring more often in the same document than the top five terms in topic 8.
#> features
#> features female government mp election party
#> female 15 13 3 3 2
#> government 13 13 3 3 2
#> mp 3 3 2 3 2
#> election 3 3 3 2 2
#> party 2 2 2 2 2
#> features
#> features sex drug study trial heart
#> sex 14 3 15 2 9
#> drug 3 2 3 2 3
#> study 15 3 14 2 9
#> trial 2 2 2 2 2
#> heart 9 3 9 2 5
A topic is considered exclusive if it is associated with terms that are not associated to another topic:
\(Excl(topic\ k) = \Sigma_{t=1}^T(\frac{\theta(v,k)}{\Sigma_{k'=1}^K \theta(v,k')})\)
The next output shows that the most exclusive topic, which has the more terms not associated with another topic, is topic 10 (0.693663) meaning that its five top terms are more specific to it. The least exclusive topic is topic 1 (0.000309).
#> topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8
#> 0.00229 0.00475 0.00608 0.10608 0.00344 0.05302 0.00939 0.00191
#> topic9 topic10
#> 0.05601 0.82406
Embedding refers to the representation of elements (documents or tokens) in a Vector Space Model (VSM). From previous sections, we saw that tokens and documents are embedded in DT and TF-IDF matrices. The difference here is that we aim at representing document and term embedding and apply it to non-BoW models and instead use terms co-occurrences (i.e. order is important).
This section proceed to embed words based on co-occurrences using the GloVe model. The idea is to reflect co-occurrences and not only documents (BoW). Here, each term is assigned to a vector and we look if the word-vectors similarity matches their words co-occurrence. Two terms are considered close if their co-occurrence is large.
The GloVe model defines two vectorial representation for one word. One representation is for the contextualized word and the other one for the centralized term. Then, the representations are combined to form only one. To build the GloVe model, we use the GlobalVectors function of the text2vec package.
To start with, we compute a feature co-occurrence symmetric matrix with a window of five (i.e. context window). The matrix is very large (8,640x8,640) and displays the terms co-occurrences in the pre-specified window.
From the feature co-occurrences matrix we compute two vector representations for a given word, one being for the central term and the other one for its contextualization. To then have a unique representation for a given word, we compute the average of the two representations. Here we set a two-dimensional representation (rank=2) for readability and we plot the vectors of the 20 most frequent terms. The plot shows that dimension 2 is strongly associated to woman and anti-associated to include and dimension 1 associated to pay, increase, design, country and anti-associated to datum.
Document embedding uses word embedding by translating the word representations into document representations. To build the document embedding, we compute the centroids of the documents. To do so, we use two methods, the centroids from averaging using DTM and weighted centroids using TF-IDF.
Before generating the centroids, we extract words in each document and for these words, we extract the word vectors and make a matrix. Then, we average all these vectors.
The centroid method averages the word representations over the number of documents:
\(vec(d) = \frac{1}{|d|} \Sigma vec(w)\)
[INTERPRET]
The weighted centroid method uses TF-IDF:
\(vec(d) = \frac{\Sigma_{w \in d}tfidf(w,d)vec}{\Sigma_{w \in d}tfidf(w,d)}\)
[ADD INTERPRETATION]
The document embedding can be used for clustering (using K-Means) or for computing the similarity/dissimilarity (Jaccard, Cosine, Euclidean) matrices before clustering (hierarchical). An alternative approach is to use the Relaxed Word Mover’s Distance to compute document similarity using word embedding. RWMD can use DTM or TF-IDF.
Distances (i.e. similarities) between documents in the RWMD approach estimate how hard it is to transform words from one document into words from another document and vice versa.
To compute the RWMD, we use the RelaxedWordMoversDistance function from the text2vec package using DTM (word and document embedding). Note that the function provides similarities and distances which allows us to plot a dendogram displaying the clustering of chapters.
The following dendogram shows that chapter 10 The Drugs Don’t Work and chapter 11 Yentl Syndrome from part IV Going to the Doctor are the most similar since the height at which they are joined (~0.14) is the smallest. Considering the RWMD, this means that the cost from converting the words of chapter 10 into the words of chapter 11 is the smallest because the distance of their word embeddings distances is the smallest.
#> INFO [12:43:50.602] epoch 1, loss 0.0470
#> INFO [12:43:50.752] epoch 2, loss 0.0274
#> INFO [12:43:50.920] epoch 3, loss 0.0217
#> INFO [12:43:51.049] epoch 4, loss 0.0183
#> INFO [12:43:51.214] epoch 5, loss 0.0160
#> INFO [12:43:51.366] epoch 6, loss 0.0143
#> INFO [12:43:51.545] epoch 7, loss 0.0130
#> INFO [12:43:51.694] epoch 8, loss 0.0119
#> INFO [12:43:51.907] epoch 9, loss 0.0111
#> INFO [12:43:52.083] epoch 10, loss 0.0104
Since RWMD generates dissimilarities between documents, we can easily obtain a clustering of the RWMD document embedding. The positions of vectors here mirror the Relaxed Word Mover’s Distances.
[ADD INTERPRETAION]
#> INFO [12:44:14.530] epoch 1, loss 0.0097
#> INFO [12:44:14.538] epoch 2, loss 0.0052
#> INFO [12:44:14.548] epoch 3, loss 0.0031
#> INFO [12:44:14.561] epoch 4, loss 0.0020
#> INFO [12:44:14.568] epoch 5, loss 0.0015
#> INFO [12:44:14.575] epoch 6, loss 0.0012
#> INFO [12:44:14.581] epoch 7, loss 0.0010
#> INFO [12:44:14.588] epoch 8, loss 0.0009
#> INFO [12:44:14.597] epoch 9, loss 0.0008
#> INFO [12:44:14.604] epoch 10, loss 0.0008
The goal of the supervised analysis here is to re-classify the 256 pages of the book in the six parts. To do so, as our data set is small, we proceed to a machine learning approach consisting of splitting the corpus into a training (80%) and a test (20%) sets, and a bootstrap cross validation, to avoid a bias due to splitting. Then, we train the classifiers on the training set and finally, we select the best classifiers from the results on the test set.
To be able to proceed to the classification method, the corpus must be cleaned in order for the features (i.e. terms) to be usable. Note that since terms must be usable (i.e. meaningful), we do not apply stemming on features. The appropriate cleaning process consists in sequential steps from the tokenization to removing useless words (i.e. stop words) and lemmatization.
Random Forest is a supervised leanring algorithm used in regression and classification problem. It consists of building many decisions trees that will be averaged to make a final prediction. Here we will do classification for the 6 parts of the book :
Part I: Daily Life
Part II: The Workplace
Part III: Design
Part IV: Going to the Doctor
Part V: Public Life
Part VI: When it Goes Wrong
First, to obtain a set of structured features, we generate a DTM of dimensions 256 by 5,738 with term frequencies. Since the DTM is relatively large, we reduce its dimension using the Latent Semantic Analysis (LSA) by targeting 30 dimensions. Note that we already remove the stop words in the cleaning and that we previously make sure that they are unimportant to our analysis.
In this case, LSA has the advantage of decreasing the number of features to be used while keeping relevant information for the analysis.
From the results of LSA, we proceed to fit the classifiers on the training set and predict the results on the test set. The below results show first the overall quality of the predictions (on the test set) of the random forest classifier on the confusion matrix.
The diagonal indicates the correct predictions of pages in their actual parts of the book and around it we see the false predictions. For example, for part II The worplace which have the most missclassifed pages, 13 pages are correctly classified and 5 pages are incorrectly predicted.
The overall statistics demonstrates an overall accuracy (i.e. proportion of correct predictions) of 83.7%, indicating that our model is good. The interesting fact here is that even after reducing the dimension to 30 of the DTM, we still obtain a high accuracy.
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 1 2 3 4 5 6
#> 1 5 0 0 0 0 0
#> 2 1 13 2 0 1 1
#> 3 0 2 8 1 0 0
#> 4 0 0 0 6 0 0
#> 5 0 0 0 0 6 0
#> 6 0 0 0 0 0 3
#>
#> Overall Statistics
#>
#> Accuracy : 0.837
#> 95% CI : (0.703, 0.927)
#> No Information Rate : 0.306
#> P-Value [Acc > NIR] : 2.21e-14
#>
#> Kappa : 0.793
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
#> Sensitivity 0.833 0.867 0.800 0.857 0.857
#> Specificity 1.000 0.853 0.923 1.000 1.000
#> Pos Pred Value 1.000 0.722 0.727 1.000 1.000
#> Neg Pred Value 0.977 0.935 0.947 0.977 0.977
#> Prevalence 0.122 0.306 0.204 0.143 0.143
#> Detection Rate 0.102 0.265 0.163 0.122 0.122
#> Detection Prevalence 0.102 0.367 0.224 0.122 0.122
#> Balanced Accuracy 0.917 0.860 0.862 0.929 0.929
#> Class: 6
#> Sensitivity 0.7500
#> Specificity 1.0000
#> Pos Pred Value 1.0000
#> Neg Pred Value 0.9783
#> Prevalence 0.0816
#> Detection Rate 0.0612
#> Detection Prevalence 0.0612
#> Balanced Accuracy 0.8750
Now, we can consider to tune the elements that compose the features. For this propose we reconsider the previous LSA dimensions and select this number based on the best accuracy.
#> [1] 0.163 0.592 0.796 0.796 0.796 0.714 0.571
The below plot shows us that the greatest accuracy is when we set 25 dimensions for the LSA.
Another method to verify whether the accuracy can be improved is to fit the model using the TF-IDF instead of the DTM. Here we reduce the dimension of the TF-IDF using the Latent Semantic Analysis (LSA) by now targeting 25 dimensions.
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 1 2 3 4 5 6
#> 1 6 1 0 0 0 0
#> 2 0 12 1 0 0 0
#> 3 0 2 9 0 0 0
#> 4 0 0 0 7 0 0
#> 5 0 0 0 0 7 0
#> 6 0 0 0 0 0 4
#>
#> Overall Statistics
#>
#> Accuracy : 0.918
#> 95% CI : (0.804, 0.977)
#> No Information Rate : 0.306
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.899
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
#> Sensitivity 1.000 0.800 0.900 1.000 1.000
#> Specificity 0.977 0.971 0.949 1.000 1.000
#> Pos Pred Value 0.857 0.923 0.818 1.000 1.000
#> Neg Pred Value 1.000 0.917 0.974 1.000 1.000
#> Prevalence 0.122 0.306 0.204 0.143 0.143
#> Detection Rate 0.122 0.245 0.184 0.143 0.143
#> Detection Prevalence 0.143 0.265 0.224 0.143 0.143
#> Balanced Accuracy 0.988 0.885 0.924 1.000 1.000
#> Class: 6
#> Sensitivity 1.0000
#> Specificity 1.0000
#> Pos Pred Value 1.0000
#> Neg Pred Value 1.0000
#> Prevalence 0.0816
#> Detection Rate 0.0816
#> Detection Prevalence 0.0816
#> Balanced Accuracy 1.0000
As expected using the TF-IDF and an optimized number of dimension, we increase the accuracy from 83.7% to 0.918%. So the tuning improved significantly the classification.
Here, we try to improve the classification of the model by adding a feature that are the lengths of the tokens, which improved the accuracy up to 93.9% which is really precise.
The SVM method is a machine learning method, that is very different compared random forest. SVM tries to classify classes by separating data points by the largest vector as possible, in our class the SVM is used to predict multiple classes and not only two.
To assess the prediction quality of the random forest model, we proceed to the classification of the pages into parts of the book using the Support Vector Machine learner. To do so, we use the svm function of the e1071 package with radial kernel and default parameter.
The following results show that the SVM model with 25 dimension for LSA and TF-IDF has an accuracy of 87.8% which is lower than the random forest model with similar parameters (91.8%). Therefore we won’t continue further with this model and keep the random forest.
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 1 2 3 4 5 6
#> 1 6 0 0 0 0 0
#> 2 0 12 2 0 1 0
#> 3 0 1 8 0 0 0
#> 4 0 0 0 7 0 0
#> 5 0 1 0 0 6 0
#> 6 0 1 0 0 0 4
#>
#> Overall Statistics
#>
#> Accuracy : 0.878
#> 95% CI : (0.752, 0.954)
#> No Information Rate : 0.306
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.848
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
#> Sensitivity 1.000 0.800 0.800 1.000 0.857
#> Specificity 1.000 0.912 0.974 1.000 0.976
#> Pos Pred Value 1.000 0.800 0.889 1.000 0.857
#> Neg Pred Value 1.000 0.912 0.950 1.000 0.976
#> Prevalence 0.122 0.306 0.204 0.143 0.143
#> Detection Rate 0.122 0.245 0.163 0.143 0.122
#> Detection Prevalence 0.122 0.306 0.184 0.143 0.143
#> Balanced Accuracy 1.000 0.856 0.887 1.000 0.917
#> Class: 6
#> Sensitivity 1.0000
#> Specificity 0.9778
#> Pos Pred Value 0.8000
#> Neg Pred Value 1.0000
#> Prevalence 0.0816
#> Detection Rate 0.0816
#> Detection Prevalence 0.1020
#> Balanced Accuracy 0.9889
From our exploratory data analysis, …
The sentiment analysis shows that …
The topic modeling shows that …
To perform the unsupervised learning analysis (embedding), we represent documents and words in a vector space model VSM. -> add results: + est-ce que similarities et clustering results ca nirait pas la aussi ?
To perform the supervised learning analysis (classification task here), we split the corpus into a training and a test set and explore two classification models: random forest and support vector machine (SVM). We find that random forest model provides a much higher overall accuracy [ADD HERE THE HIGHEST OVERALL ACCURACY ONCE WE KNOW IT FOR SURE IN THE ANALYSIS] when classifying the pages of the book in their respective parts than the SVM. [ADD HERE IF WE OBSERVE IMPORVEMENT WHEN ADDING FEATURES].
Thus, our final random forest model using [TFIDF OR DTM: WE FIRST NEED TO FINISH THIS PART IN ANALYSIS] is capable of identifying to which part of Invisible Womena page belongs using features with ?% of accuracy.
To conclude, we find that chapters of the book can share a similar vocabulary and that chapters approach very different topics from medical to technological… + sentiment of the book …